Skip to content

🐛 improve early request collection reliability in Firefox#4573

Merged
BenoitZugmeyer merged 2 commits into
v7from
benoit/debug-no-bs-firefox
May 11, 2026
Merged

🐛 improve early request collection reliability in Firefox#4573
BenoitZugmeyer merged 2 commits into
v7from
benoit/debug-no-bs-firefox

Conversation

@BenoitZugmeyer
Copy link
Copy Markdown
Member

Motivation

The microfrontend e2e test microfrontend > RUM > with beforeSend > expose handling stack for fetch requests › async fails ~70% of the time on firefox-pinned (10 failed / 6 flaky / 4 passed out of 20 reproductions). The same race affects any fetch resource event in production: event.context.handlingStack (and the rest of the request-derived domain context — requestInit, response, error, …) is intermittently undefined for fetch resources on Firefox.

Root cause

When a fetch resolves, two things happen around the same time:

  • the browser dispatches a PerformanceResourceTiming entry to our observer
  • afterSend() resumes from await responsePromise and notifies REQUEST_COMPLETED

On Firefox the observer task can fire before the resolve microtask runs, so the request isn't in the requestRegistry yet when assembleResource() looks it up via requestRegistry.getMatchingRequest(). The lookup returns undefined and the resulting resource event is missing all request-derived domain context.

This was introduced by #4285 always collect early requests (commit 2b71062e7), which switched from request-driven (performance.getEntriesByName() at REQUEST_COMPLETED time) to entry-driven matching.

CI logs of the failing case make the ordering obvious:

[failure run]
[DEBUG resourceCollection assembleResource]
  entryStartTime:90, matchedRequest:null, registrySnapshot:[]   ← runs first, registry empty
[DEBUG requestRegistry REQUEST_COMPLETED]
  url:.../ok, startRel:89, hasHandlingStack:true, nowRel:121     ← arrives too late

Changes

  • startResourceCollection: for request-type entries (fetch / xmlhttprequest initiator), defer the registry lookup by REQUEST_MATCHING_DELAY (50 ms) via setTimeout. In practice the matching REQUEST_COMPLETED arrives within ~30 ms even on slow CI runners, so 50 ms is a generous bound.
  • assembleResource now takes the resolved request directly instead of pulling it from the registry — the matching policy is owned by the call site where the timing context is clear.
  • requestRegistry is unchanged; getMatchingRequest is still synchronous.
  • Unit tests drive the new delay through mockClock.tick(REQUEST_MATCHING_DELAY) inside runTasks.

Test instructions

Repro on the e2e suite (Firefox is the most reliable, but the race is browser-agnostic):

yarn playwright test --config test/e2e/playwright.config.ts --project firefox-pinned \
  --repeat-each=20 \
  -g 'microfrontend RUM with beforeSend expose handling stack for fetch requests async'

CI validation (e2e job scoped to the failing test, --repeat-each=20):

pipeline chromium firefox-pinned webkit-pinned
before fix 20/20 4/20 (10 failed, 6 flaky) 20/20
after fix 20/20 20/20 20/20

Checklist

  • Tested locally
  • Tested on staging
  • Added unit tests for this change.
  • Added e2e/integration tests for this change.
  • Updated documentation and/or relevant AGENTS.md file

@BenoitZugmeyer BenoitZugmeyer changed the base branch from main to thomas.lebeau/no-bs May 7, 2026 16:35
@BenoitZugmeyer BenoitZugmeyer force-pushed the benoit/debug-no-bs-firefox branch from a3581e4 to 8558f01 Compare May 7, 2026 16:36
When a fetch resolves, two things happen around the same time:
- the browser dispatches a PerformanceResourceTiming entry to our observer
- afterSend() resumes from `await responsePromise` and notifies REQUEST_COMPLETED

On Firefox the observer task can fire before the resolve microtask runs, so
the request isn't in the registry yet when assembleResource() looks it up,
and the resource event ends up without handlingStack / requestInit / etc.

Defer the registry lookup for request-type entries by REQUEST_MATCHING_DELAY
(50ms) so the matching REQUEST_COMPLETED has time to land. assembleResource
now takes the resolved request directly instead of pulling it from the
registry. Tests drive the delay through mockClock.tick().
@BenoitZugmeyer BenoitZugmeyer force-pushed the benoit/debug-no-bs-firefox branch from 8558f01 to b645dab Compare May 7, 2026 16:36
@BenoitZugmeyer BenoitZugmeyer changed the title 🐛 fix race between PerformanceObserver and REQUEST_COMPLETED on Firefox 🐛 improve early request collection reliability in Firefox May 7, 2026
@BenoitZugmeyer BenoitZugmeyer changed the base branch from thomas.lebeau/no-bs to v7 May 7, 2026 16:38
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 7, 2026

Bundles Sizes Evolution

📦 Bundle Name Base Size Local Size 𝚫 𝚫% Status
Rum 169.22 KiB 169.26 KiB +39 B +0.02%
Rum Profiler 6.10 KiB 6.10 KiB 0 B 0.00%
Rum Recorder 21.23 KiB 21.23 KiB 0 B 0.00%
Logs 54.56 KiB 54.56 KiB 0 B 0.00%
Rum Slim 127.56 KiB 127.59 KiB +31 B +0.02%
Worker 22.99 KiB 22.99 KiB 0 B 0.00%
🚀 CPU Performance
Action Name Base CPU Time (ms) Local CPU Time (ms) 𝚫%
RUM - add global context 0.0039 0.0017 -56.41%
RUM - add action 0.0211 0.0115 -45.50%
RUM - add error 0.02 0.0115 -42.50%
RUM - add timing 0.001 0.0004 -60.00%
RUM - start view 0.0194 0.0102 -47.42%
RUM - start/stop session replay recording 0.0014 0.0007 -50.00%
Logs - log message 0.0342 0.0179 -47.66%
🧠 Memory Performance
Action Name Base Memory Consumption Local Memory Consumption 𝚫
RUM - add global context 43.46 KiB 38.36 KiB -5.10 KiB
RUM - add action 67.59 KiB 63.50 KiB -4.09 KiB
RUM - add timing 42.64 KiB 37.51 KiB -5.14 KiB
RUM - add error 75.08 KiB 70.44 KiB -4.64 KiB
RUM - start/stop session replay recording 49.95 KiB 41.42 KiB -8.54 KiB
RUM - start view 497.36 KiB 482.95 KiB -14.41 KiB
Logs - log message 59.14 KiB 54.79 KiB -4.35 KiB

🔗 RealWorld

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented May 7, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 83.33%
Overall Coverage: 77.48% (+0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 7636e78 | Docs | Datadog PR Page | Give us feedback!

@BenoitZugmeyer BenoitZugmeyer marked this pull request as ready for review May 11, 2026 08:42
@BenoitZugmeyer BenoitZugmeyer requested a review from a team as a code owner May 11, 2026 08:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b645dabc27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +64 to +66
setTimeout(
() => handleResource(() => assembleResource(entry, requestRegistry.getMatchingRequest(entry), configuration)),
REQUEST_MATCHING_DELAY
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cancel delayed matching timers on stop

For request-type entries this timer can still fire after startResourceCollection().stop() has been called; taskQueue.stop() only clears work already queued, so the delayed callback will call handleResource() later and push a new task that can emit a stale resource event after the collection was stopped, especially during SDK teardown/re-init with in-flight fetch/XHR entries. Store and clear these timeout ids, or guard the callback with a stopped flag.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@mormubis mormubis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One thing: we should clear the pending setTimeout timers in stop() so a delayed callback doesn't push a stale task after teardown. The batch stops before resourceCollection.stop(), so the assembly pipeline is still listening when the timer fires. What do you think?

@BenoitZugmeyer
Copy link
Copy Markdown
Member Author

/merge

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

gh-worker-devflow-routing-ef8351 Bot commented May 11, 2026

View all feedbacks in Devflow UI.

2026-05-11 09:51:18 UTC ℹ️ Start processing command /merge


2026-05-11 09:51:21 UTC ℹ️ MergeQueue: Stack detected — automatically merging the entire stack.

@BenoitZugmeyer
Copy link
Copy Markdown
Member Author

/merge -c

@gh-worker-devflow-routing-ef8351
Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-05-11 10:40:13 UTC ℹ️ Start processing command /merge -c
If you need support, contact us on Slack #devflow!


2026-05-11 10:40:14 UTCDevflow: /merge -c

This merge request is not in the queue and can't be unqueued

To get help about command usage, write /merge --help

If you need support, contact us on Slack #devflow with those details!

@BenoitZugmeyer BenoitZugmeyer merged commit eceea82 into v7 May 11, 2026
17 of 18 checks passed
@BenoitZugmeyer BenoitZugmeyer deleted the benoit/debug-no-bs-firefox branch May 11, 2026 10:41
@github-actions github-actions Bot locked and limited conversation to collaborators May 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants